Jeremiah Yarmie - 7712088
MBIO 7160 - Special Problems in Microbiology: Advanced Biostatistics
Final Project
Why may the phylogenetic approach used in the R package treeWAS, which is based on phylogenetics and hypothesis testing, be more appropriate for conducting bacterial genome-wide association studies (GWAS) when compared to other approaches that incorporate multivariate analysis/principal component analysis (PCA) and linear mixed models (LMM)?
Collins and Didelot, the authors of the tool, claim that treeWAS outperforms these other approaches in both terms of not only statistical power, but also in the appropriateness of the approach when considering bacterial population structure and events of homologous recombination.
To explore this claim, treeWAS was run on a genomic data set of Mycobacterium tuberculosis, a clonal bacterial pathogen that undergoes almost no recombination, that had been previously analyzed and published using these a PCA-LMM GWAS approach, the results of which will be compared.
GWAS attempts to identify causative relationships between genetic variants and phenotypic outcomes within a population. It is an inherently statistical approach, concerned with maximizing precision and power in its analysis.
Bacterial GWAS is a top-down approach to conducting genetic research, compared to bottom-up approaches rooted in molecular biology like creating knock outs, mutant strains, etc.
One approach to GWAS is an allele-counting method, which looks for an increased presence of a certain allele at a loci in cases relative to controls. This contrasts with homoplasy approaches, which look at the occurrence of repeated and independently forming mutations occurring on branches of cases, relative to controls, on a phylogenetic tree. Put simply, homoplasy is the presence of similar genetic loci on different branches of a phylogenetic tree. treeWAS, for example is a homoplasy approach to GWAS, whereas PCA-LMM is an allele-counting approach.
Homoplasy approaches account for the effects of population structure and linkage disequilibrium inherently in its design. As well, homoplasy counting requires a much smaller sample size to reach statistical significance compared to allele counting.
A key issue when conducting bacterial GWAS is the confounding effect of genetic relatedness between different bacterial isolates and strains. This is due to the fact that there exist distinct lineages of clonally reproducing bacterial strains, which can result in suprious associations arising in association studies. This results in higher false positive of type I error rates.
Unlike with humans, almost the entire bacterial chromosome is under linkage disequilibrium. What results is a patchwork of recombined regions on a tract of linked regions called a clonal frame, where all regions of the clonal frame are in linkage disequilibrium.
This results in it being difficult to distinguish potential markers from the other accumulated variants found on the chromosome that would be correlated with phenotypes. Ignoring the clonal aspect of bacteria entirely would result in your identifying thousands of associations.
Population structure as a result of clonal reproduction can result in population stratification, where certain subgroups of closely-related individuals can give rise to spurious associations with phenotypes of interest. This is particularly a problem in highly clonal and rarely recombining bacteria like Mycobacterium tuberculosis. Stratification is seen if members of a population and subpopulation structure contain a non-random distribution of alleles.
While population structure and clonal frames make it difficult to separate true associations from non-causal ones, all standard GWAS approaches perform better when there are larger sample sizes. They also perform more poorly on highly clonal and rarely recombining organism genomes.
Genome-wide linkage disequilibrium as a result of a strong clonal frame can lead to type I errors (false positives). As such, genomes for bacteria like Mycobacterium tuberculosis are at a large risk for type I error.
PCA is one approach to correct for population structure and stratification when conducing GWAS analysis. The largest principal components tend to correspond to major population structures and strain lineages or clonal frames.
Most GWAS approaches using PCA to account for population structure conduct downstream LMM analysis because of their compatibility. LMM are an extension of linear regression that include both fixed and random effects. Principal components identified that consider population structure are then chosen as fixed effect factors in downstream linear regression. This approach controls for type I error with less power loss.
Approaches to correct for population structure, including PCA, reduce the statistical power of GWAS considerably. Conducting GWAS using PCA has a sensitivity-specificity trade-off, where your statistical power decreases considerably if you correct for population structure and reduce type I error. The loss of power when conducting PCA to account for population structure occurs because of differences in strains contributing substantial phenotypic variability.
The authors of the treeWAS package claim that their approach balances a low false positive rate, high sensitivity, and high positive predictive value. This contrasts with multivariate analysis approaches, which trade strong sensitivity for a weak positive predictive value, and cluster-based approaches, which perform moderately well in both positive predictive value and sensitivity, but do not factor in the clonal population structure of bacteria, as well as the occurrence of homologous recombination.
treeWAS was designed to be appropriate to use both with highly clonal bacteria and extremely recombinant ones (through the implementation of the recombination detection tool ClonalFrameML).
When compared to other cluster-based or dimensional reduction GWAS approaches, treeWAS performs better in terms of both power and precision on simulated data.
treeWAS uses a null genetic dataset to conduct hypothesis tests on the validity of associations seen in the test data. This null dataset maintains the following elements from the test dataset: clonal genealogy, terminal phenotype, genetic composition and homoplasy (substitutions per site due to both mutation and recombination).
The null dataset does not recreate any of the “true” associations from the test dataset, except ones expected to arise by random chance or due to one of our known confounding factors like recombination and population structure.
Comparisons between genotype-phenotype associations in the null dataset and the test dataset are conducted to identify associations that have statistical and phylogenetic basis.
Step 1: Phylogenetic reconstruction (however, you can input a tree within the tool’s parameters)
Step 2: Compute the homoplasy distribution of your test data using the genetic data.
Step 3: Simulate null genetic dataset
where:
- the original phenotype is maintained across the tree leaves
- variants occur on different branches
- the simulated dataset resembles the test dataset in terms of population structure and genetic composition, but not association with phenotypes
Step 4: Ancestral state estimation
- to determine the likely genetic composition of nodes in the tree, for the purpose of conducting the association tests
Step 5: Three different association tests are applied to each loci
- treeWAS utilizes three scores of association to enhance the statistical power of the tool
Firstly, a null distribution of the three association score statistics is calculated by measuring associations between the null loci and the true phenotypes. These represent the null hypothesis of treeWAS’s hypothesis tests.
Then the three association tests are done on our test dataset.
Test 1:
- The terminal score considers tree tip states.
Test 2:
-The simultaneous score considers ancestral state reconstruction.
-Measures the degree of parallel change in genotype and phenotype across tree branches.
-Counts the number of branches containing a simultaneous substitution in both genotype and phenotype.
-Simultaneous substitutions indicate a strong relationship between genotype and phenotype.
-Able to detect associations arising through similar or complementary pathways.
Test 3:
-The subsequent score measures the proportion of the entire tree in which genotype and phenotype coexist.
-Allows us to measure in what proportion of tree branches we expect the genotype and phenotype to be in the same state.
Step 6: Finally, a significance threshold is drawn after correcting for multiple testing, considering both the multiple loci and the three association tests used.
-By accounting for the entire tree, the authors of treeWAS have the ability to utilize three different association tests, greatly increasing the statistical power of the tool.
GWAS approaches can be compared to each other based on their statistical power, that is how accurately the approach can identify associations between genotypes and phenotypes, and its ability to account for spurious associations, reducing its type I error rate.
While PCA GWAS approaches have a rather high sensitivity, being able to detect associations well, it has a weaker positive predictive rate compared to treeWAS due to a higher false positive rate.
The two datasets originally selected to be used in this project, collections of Helicobacter pylori and Mycobacterium tuberculosis genomes, were chosen to be representative of the two ends of the spectrum when it comes to genetic variability and plasticity, since treeWAS is adept at conducting GWAS for both recombinant and clonal bacterial species. As well, these two datasets were taken from GWAS studies that both implemented a PCA-LMM approach.
The Mycobacterium tuberculosis dataset was corrected for population stratification using dimensional reduction before being analyzed with the software package GEMMA which incorporates a LMM approach. Mycobacterium tuberculosis only has 1.1% homoplasic SNPs within its core genome. The phenotypes associated with the Mycobacterium tuberculosis dataset are the various antimicrobial resistance and susceptibility measurements.
I also intended on conducting a GWAS analysis of a Helicobacter pylori dataset from a previous study that included PCA via Eigenvalue Decomposition before identifying SNVs, indels, and gene presence/absence using a k-mer method in bugwas, and conducting a LMM approach. The phenotypes associated with the Helicobacter pylori dataset were the degree of stomach disease progression in patients from which the strains were isolated.
However, after much attempt (any many job failures) to prepare the genomic data for input into treeWAS, I decided that it was too computationally intensive to implement in this project.
The issues began with the way the data was stored and publically available itself. In Berthenet et al.’s 2018 paper, they stated that all of the data used in their GWAS study was publically available on National Centre for Biotechnology Information’s Short Read Archive, but that wasn’t the case. Instead, draft genomes in contigs were deposited in the BioProject database without short read files. Despite multiple attempts to produce a multi-aligned FASTA file of the 170 genomes in contigs using various tools including clustal omega, clustal w, and MAFFT, all of the tools failed after several days of running, including on a supercomputer cluster with 64 CPUs and 512 GB of memory. It was after these attempts that I decided to drop the Helicobacter pylori dataset.
A small subset of the several-thousand genome dataset used in Farhat et al.’s 2019 Mycobacterium tuberculosis GWAS study was done for two reasons. Firstly, it was chosen to be of comparable size to the 170 Helicobacter pylori genomes I was originally intending to analyze. Secondly, this small genome subset was chosen to be able to fit within the short timeframe and computational manner of a class project. Finally, this subset was chosen to see if it would be a barrier to identifying similar loci using the homoplasy approach of treeWAS, since homoplasy GWAS approaches require fewer isolates to reach statistical significance when compared to PCA-LMM approches.
If treeWAS truly has the precision and power that the authors claim it does, I expected that the results of the GWAS analysis would be similar to the PCA/LMM approaches, in terms of identified loci, despite the significantly smaller genome dataset size.
The Mycobacterium tuberculosis phenotypic metadataset was taken from Farhat’s study and manipulated in R using tidyverse tools to prepare it for input into treeWAS.
#some data massaging to prepare the phenotypic data for input into treeWAS
mtb_meta <- select(mtb_meta_raw, isolates, INH, RIF, EMB, STR, CIP, CYS, CAP, ETA, KAN, OFLX, PAS, PZA, AMK, MOXI, PRO, CLO, RBU, LEVO) #pruning down the metadata tree to just include isolates and AMR phenotypes
mtb_meta[mtb_meta == "NULL"] <- NA #the dataset originally came with NULLs instead of NAs
mtb_isos_meta <- semi_join(mtb_meta, mtb_isos_raw, by = "isolates") #selecting our subset of isolates
#the next section exports each AMR phenotypic column as a vector and associates it to the respective isolates
###i probably could have done this with purrr?
mtb_phen_inh <- as_vector(mtb_isos_meta$INH)
names(mtb_phen_inh) <- mtb_isos_meta$isolates
mtb_phen_rif <- as_vector(mtb_isos_meta$RIF)
names(mtb_phen_rif) <- mtb_isos_meta$isolates
mtb_phen_emb <- as_vector(mtb_isos_meta$EMB)
names(mtb_phen_emb) <- mtb_isos_meta$isolates
mtb_phen_str <- as_vector(mtb_isos_meta$STR)
names(mtb_phen_str) <- mtb_isos_meta$isolates
mtb_phen_cip <- as_vector(mtb_isos_meta$CIP)
names(mtb_phen_cip) <- mtb_isos_meta$isolates
mtb_phen_cys <- as_vector(mtb_isos_meta$CYS)
names(mtb_phen_cys) <- mtb_isos_meta$isolates
mtb_phen_cap <- as_vector(mtb_isos_meta$CAP)
names(mtb_phen_cap) <- mtb_isos_meta$isolates
mtb_phen_eta <- as_vector(mtb_isos_meta$ETA)
names(mtb_phen_eta) <- mtb_isos_meta$isolates
mtb_phen_kan <- as_vector(mtb_isos_meta$KAN)
names(mtb_phen_kan) <- mtb_isos_meta$isolates
mtb_phen_oflx <- as_vector(mtb_isos_meta$OFLX)
names(mtb_phen_oflx) <- mtb_isos_meta$isolates
mtb_phen_pas <- as_vector(mtb_isos_meta$PAS)
names(mtb_phen_pas) <- mtb_isos_meta$isolates
mtb_phen_pza <- as_vector(mtb_isos_meta$PZA)
names(mtb_phen_pza) <- mtb_isos_meta$isolates
mtb_phen_amk <- as_vector(mtb_isos_meta$AMK)
names(mtb_phen_amk) <- mtb_isos_meta$isolates
mtb_phen_moxi <- as_vector(mtb_isos_meta$MOXI)
names(mtb_phen_moxi) <- mtb_isos_meta$isolates
mtb_phen_pro <- as_vector(mtb_isos_meta$PRO)
names(mtb_phen_pro) <- mtb_isos_meta$isolates
mtb_phen_clo <- as_vector(mtb_isos_meta$CLO)
names(mtb_phen_clo) <- mtb_isos_meta$isolates
mtb_phen_rbu <- as_vector(mtb_isos_meta$RBU)
names(mtb_phen_rbu) <- mtb_isos_meta$isolates
mtb_phen_levo <- as_vector(mtb_isos_meta$LEVO)
names(mtb_phen_levo) <- mtb_isos_meta$isolates
The Mycobacterium tuberculosis short reads from illumina sequencing were downloaded from the NCBI’s Short Read Archive before being run through the PHAC-NML’s SNVPhyl single nucleotide variant analysis pipeline.
In short, SNVPhyl maps sequence reads to a reference genome, identifying variants, before filtering them through user-defined parameters, before creating a SNV alignment, a SNV distance matrix, filter and quality statistics, and a phylogenetic tree.
The two outputs of SNVPhyl that are relevant for running treeWAS are:
- a multi-aligned FASTA file of all single nucleotide variants amongst your query genomes.
- a phylogenetic tree in the newick file format.
These files were imported into R with minimal processing before treeWAS was able to be called.
#processing of the DNA sequence data
#next we import our genetic data
mtb_dna <- read.dna(file = here("mtb_data/mtb_snv_align.fasta"), format = "fasta") #import our multi-aligned FASTA of all isolate genome SNVs
mtb_snps <- DNAbin2genind(mtb_dna)@tab #converts all SNVs to a DNA bin matrix
mtb_tree <- read.tree(file = here("mtb_data/mtb_tree.nhx")) #import out phylogenetic tree created with FastTree
Running treeWAS is simple, with a single line of code being sufficient to run on default parameters. Defaults of note include a p-value threshold of 0.01 and Bonferroni multiple test correcting being turned on.
Several of the phenotypic metadatasets had insufficient data to conduct GWAS, often there were identified resistant strains but not sensitive ones, or the entire metadata vector was full of NAs.
The phenotypic information for levofloxacin, rifabutin, cloxacillin, moxicillin, amoxicillin, pyrazinamide, para-aminosalicylic acid, and ciprofloxacin did not have sufficient information to conduct GWAS.
#run treeWAS
inh <- treeWAS(snps = mtb_snps,
phen = mtb_phen_inh,
tree = mtb_tree,
seed = 1)
## Warning in treeWAS(snps = mtb_snps, phen = mtb_phen_inh, tree = mtb_tree, : The phenotypic variable for individual(s) reference is missing.
## Removing these individuals from the analysis.
## [1] "treeWAS snps sim done @ 2019-11-28 16:57:14"
## [1] "Reconstructions completed @ 2019-11-28 16:57:25"
## [1] "Started running terminal test @ 2019-11-28 16:57:25"
## [1] "Real data scores completed for terminal test @ 2019-11-28 16:57:25"
## [1] "Simulated data scores completed for terminal test @ 2019-11-28 16:57:25"
## [1] "Started running simultaneous test @ 2019-11-28 16:57:25"
## [1] "Real data scores completed for simultaneous test @ 2019-11-28 16:57:25"
## [1] "Simulated data scores completed for simultaneous test @ 2019-11-28 16:57:25"
## [1] "Started running subsequent test @ 2019-11-28 16:57:25"
## [1] "Real data scores completed for subsequent test @ 2019-11-28 16:57:25"
## [1] "Simulated data scores completed for subsequent test @ 2019-11-28 16:57:25"
## [1] "Finished running terminal test @ 2019-11-28 16:57:26"
## [1] "Finished running simultaneous test @ 2019-11-28 16:57:26"
## [1] "Finished running subsequent test @ 2019-11-28 16:57:26"
## [1] "ID of significant loci completed @ 2019-11-28 16:57:26"
rif <- treeWAS(snps = mtb_snps,
phen = mtb_phen_rif,
tree = mtb_tree,
seed = 1)
## Warning in treeWAS(snps = mtb_snps, phen = mtb_phen_rif, tree = mtb_tree, : The phenotypic variable for individual(s) reference is missing.
## Removing these individuals from the analysis.
## [1] "treeWAS snps sim done @ 2019-11-28 16:59:48"
## [1] "Reconstructions completed @ 2019-11-28 17:00:06"
## [1] "Started running terminal test @ 2019-11-28 17:00:06"
## [1] "Real data scores completed for terminal test @ 2019-11-28 17:00:06"
## [1] "Simulated data scores completed for terminal test @ 2019-11-28 17:00:06"
## [1] "Started running simultaneous test @ 2019-11-28 17:00:06"
## [1] "Real data scores completed for simultaneous test @ 2019-11-28 17:00:06"
## [1] "Simulated data scores completed for simultaneous test @ 2019-11-28 17:00:06"
## [1] "Started running subsequent test @ 2019-11-28 17:00:06"
## [1] "Real data scores completed for subsequent test @ 2019-11-28 17:00:06"
## [1] "Simulated data scores completed for subsequent test @ 2019-11-28 17:00:06"
## [1] "Finished running terminal test @ 2019-11-28 17:00:08"
## [1] "Finished running simultaneous test @ 2019-11-28 17:00:08"
## [1] "Finished running subsequent test @ 2019-11-28 17:00:08"
## [1] "ID of significant loci completed @ 2019-11-28 17:00:08"
emb <- treeWAS(snps = mtb_snps,
phen = mtb_phen_emb,
tree = mtb_tree,
seed = 1)
## Warning in treeWAS(snps = mtb_snps, phen = mtb_phen_emb, tree = mtb_tree, : The phenotypic variable for individual(s) reference is missing.
## Removing these individuals from the analysis.
## [1] "treeWAS snps sim done @ 2019-11-28 17:02:37"
## [1] "Reconstructions completed @ 2019-11-28 17:02:48"
## [1] "Started running terminal test @ 2019-11-28 17:02:48"
## [1] "Real data scores completed for terminal test @ 2019-11-28 17:02:48"
## [1] "Simulated data scores completed for terminal test @ 2019-11-28 17:02:48"
## [1] "Started running simultaneous test @ 2019-11-28 17:02:48"
## [1] "Real data scores completed for simultaneous test @ 2019-11-28 17:02:48"
## [1] "Simulated data scores completed for simultaneous test @ 2019-11-28 17:02:48"
## [1] "Started running subsequent test @ 2019-11-28 17:02:48"
## [1] "Real data scores completed for subsequent test @ 2019-11-28 17:02:48"
## [1] "Simulated data scores completed for subsequent test @ 2019-11-28 17:02:48"
## [1] "Finished running terminal test @ 2019-11-28 17:02:49"
## [1] "Finished running simultaneous test @ 2019-11-28 17:02:49"
## [1] "Finished running subsequent test @ 2019-11-28 17:02:49"
## [1] "ID of significant loci completed @ 2019-11-28 17:02:49"
str <- treeWAS(snps = mtb_snps,
phen = mtb_phen_str,
tree = mtb_tree,
seed = 1)
## Warning in treeWAS(snps = mtb_snps, phen = mtb_phen_str, tree = mtb_tree, : The phenotypic variable for individual(s) c("SRR2467267", "SRR2467268", "SRR2467269", "reference") is missing.
## Removing these individuals from the analysis.
## [1] "treeWAS snps sim done @ 2019-11-28 17:05:02"
## [1] "Reconstructions completed @ 2019-11-28 17:05:13"
## [1] "Started running terminal test @ 2019-11-28 17:05:13"
## [1] "Real data scores completed for terminal test @ 2019-11-28 17:05:13"
## [1] "Simulated data scores completed for terminal test @ 2019-11-28 17:05:13"
## [1] "Started running simultaneous test @ 2019-11-28 17:05:13"
## [1] "Real data scores completed for simultaneous test @ 2019-11-28 17:05:13"
## [1] "Simulated data scores completed for simultaneous test @ 2019-11-28 17:05:13"
## [1] "Started running subsequent test @ 2019-11-28 17:05:13"
## [1] "Real data scores completed for subsequent test @ 2019-11-28 17:05:13"
## [1] "Simulated data scores completed for subsequent test @ 2019-11-28 17:05:13"
## [1] "Finished running terminal test @ 2019-11-28 17:05:14"
## [1] "Finished running simultaneous test @ 2019-11-28 17:05:14"
## [1] "Finished running subsequent test @ 2019-11-28 17:05:14"
## [1] "ID of significant loci completed @ 2019-11-28 17:05:14"
#cip has insufficient phenotypic data
#cip <- treeWAS(snps = mtb_snps,
#phen = mtb_phen_cip,
#tree = mtb_tree,
#seed = 1)
#cys has insufficient phenotypic data
#cys <- treeWAS(snps = mtb_snps,
#phen = mtb_phen_cys,
#tree = mtb_tree,
#seed = 1)
cap <- treeWAS(snps = mtb_snps,
phen = mtb_phen_cap,
tree = mtb_tree,
seed = 1)
## Warning in treeWAS(snps = mtb_snps, phen = mtb_phen_cap, tree = mtb_tree, : The phenotypic variable for individual(s) c("SRR057510", "SRR057595", "SRR057610", "SRR057619", "SRR057734", "SRR057768", "SRR057770", "SRR057771", "SRR058116", "SRR058369", "SRR058370", "SRR058371", "SRR058372", "SRR058373", "SRR058377", "SRR058399", "SRR058417", "SRR2467268", "SRR2467269", "SRR671776", "SRR671777", "reference") is missing.
## Removing these individuals from the analysis.
## [1] "treeWAS snps sim done @ 2019-11-28 17:07:35"
## [1] "Reconstructions completed @ 2019-11-28 17:07:49"
## [1] "Started running terminal test @ 2019-11-28 17:07:49"
## [1] "Real data scores completed for terminal test @ 2019-11-28 17:07:49"
## [1] "Simulated data scores completed for terminal test @ 2019-11-28 17:07:50"
## [1] "Started running simultaneous test @ 2019-11-28 17:07:50"
## [1] "Real data scores completed for simultaneous test @ 2019-11-28 17:07:50"
## [1] "Simulated data scores completed for simultaneous test @ 2019-11-28 17:07:50"
## [1] "Started running subsequent test @ 2019-11-28 17:07:50"
## [1] "Real data scores completed for subsequent test @ 2019-11-28 17:07:50"
## [1] "Simulated data scores completed for subsequent test @ 2019-11-28 17:07:50"
## [1] "Finished running terminal test @ 2019-11-28 17:07:51"
## [1] "Finished running simultaneous test @ 2019-11-28 17:07:51"
## [1] "Finished running subsequent test @ 2019-11-28 17:07:51"
## [1] "ID of significant loci completed @ 2019-11-28 17:07:51"
eta <- treeWAS(snps = mtb_snps,
phen = mtb_phen_eta,
tree = mtb_tree,
seed = 1)
## Warning in treeWAS(snps = mtb_snps, phen = mtb_phen_eta, tree = mtb_tree, : The phenotypic variable for individual(s) c("SRR057510", "SRR057595", "SRR057610", "SRR057619", "SRR057734", "SRR057768", "SRR057770", "SRR057771", "SRR058116", "SRR058369", "SRR058370", "SRR058371", "SRR058372", "SRR058373", "SRR058377", "SRR058399", "SRR058417", "SRR2467267", "SRR2467268", "SRR2467269", "SRR671719", "SRR671720", "SRR671721", "SRR671722", "SRR671724", "SRR671776", "SRR671777", "SRR671779", "SRR671780", "SRR671865", "SRR671866", "SRR671867", "SRR671868", "SRR671869", "reference") is missing.
## Removing these individuals from the analysis.
## [1] "treeWAS snps sim done @ 2019-11-28 17:10:06"
## [1] "Reconstructions completed @ 2019-11-28 17:10:14"
## [1] "Started running terminal test @ 2019-11-28 17:10:14"
## [1] "Real data scores completed for terminal test @ 2019-11-28 17:10:14"
## [1] "Simulated data scores completed for terminal test @ 2019-11-28 17:10:14"
## [1] "Started running simultaneous test @ 2019-11-28 17:10:14"
## [1] "Real data scores completed for simultaneous test @ 2019-11-28 17:10:14"
## [1] "Simulated data scores completed for simultaneous test @ 2019-11-28 17:10:14"
## [1] "Started running subsequent test @ 2019-11-28 17:10:14"
## [1] "Real data scores completed for subsequent test @ 2019-11-28 17:10:14"
## [1] "Simulated data scores completed for subsequent test @ 2019-11-28 17:10:15"
## [1] "Finished running terminal test @ 2019-11-28 17:10:15"
## [1] "Finished running simultaneous test @ 2019-11-28 17:10:15"
## [1] "Finished running subsequent test @ 2019-11-28 17:10:15"
## [1] "ID of significant loci completed @ 2019-11-28 17:10:15"
kan <- treeWAS(snps = mtb_snps,
phen = mtb_phen_kan,
tree = mtb_tree,
seed = 1)
## Warning in treeWAS(snps = mtb_snps, phen = mtb_phen_kan, tree = mtb_tree, : The phenotypic variable for individual(s) c("SRR057510", "SRR057595", "SRR057610", "SRR057619", "SRR057734", "SRR057768", "SRR057770", "SRR057771", "SRR058116", "SRR058369", "SRR058370", "SRR058371", "SRR058372", "SRR058373", "SRR058377", "SRR058399", "SRR058417", "SRR2467268", "SRR2467269", "SRR671776", "SRR671777", "SRR671779", "SRR671780", "SRR671783", "reference") is missing.
## Removing these individuals from the analysis.
## [1] "treeWAS snps sim done @ 2019-11-28 17:12:31"
## [1] "Reconstructions completed @ 2019-11-28 17:12:41"
## [1] "Started running terminal test @ 2019-11-28 17:12:41"
## [1] "Real data scores completed for terminal test @ 2019-11-28 17:12:41"
## [1] "Simulated data scores completed for terminal test @ 2019-11-28 17:12:41"
## [1] "Started running simultaneous test @ 2019-11-28 17:12:41"
## [1] "Real data scores completed for simultaneous test @ 2019-11-28 17:12:41"
## [1] "Simulated data scores completed for simultaneous test @ 2019-11-28 17:12:41"
## [1] "Started running subsequent test @ 2019-11-28 17:12:41"
## [1] "Real data scores completed for subsequent test @ 2019-11-28 17:12:41"
## [1] "Simulated data scores completed for subsequent test @ 2019-11-28 17:12:41"
## [1] "Finished running terminal test @ 2019-11-28 17:12:42"
## [1] "Finished running simultaneous test @ 2019-11-28 17:12:42"
## [1] "Finished running subsequent test @ 2019-11-28 17:12:42"
## [1] "ID of significant loci completed @ 2019-11-28 17:12:42"
oflx <- treeWAS(snps = mtb_snps,
phen = mtb_phen_oflx,
tree = mtb_tree,
seed = 1)
## Warning in treeWAS(snps = mtb_snps, phen = mtb_phen_oflx, tree = mtb_tree, : The phenotypic variable for individual(s) c("SRR057510", "SRR057595", "SRR057610", "SRR057619", "SRR057734", "SRR057768", "SRR057770", "SRR057771", "SRR058116", "SRR058369", "SRR058370", "SRR058371", "SRR058372", "SRR058373", "SRR058377", "SRR058399", "SRR058417", "SRR2467268", "SRR2467269", "SRR671776", "SRR671777", "reference") is missing.
## Removing these individuals from the analysis.
## [1] "treeWAS snps sim done @ 2019-11-28 17:14:55"
## [1] "Reconstructions completed @ 2019-11-28 17:15:05"
## [1] "Started running terminal test @ 2019-11-28 17:15:05"
## [1] "Real data scores completed for terminal test @ 2019-11-28 17:15:05"
## [1] "Simulated data scores completed for terminal test @ 2019-11-28 17:15:05"
## [1] "Started running simultaneous test @ 2019-11-28 17:15:05"
## [1] "Real data scores completed for simultaneous test @ 2019-11-28 17:15:05"
## [1] "Simulated data scores completed for simultaneous test @ 2019-11-28 17:15:05"
## [1] "Started running subsequent test @ 2019-11-28 17:15:05"
## [1] "Real data scores completed for subsequent test @ 2019-11-28 17:15:05"
## [1] "Simulated data scores completed for subsequent test @ 2019-11-28 17:15:05"
## [1] "Finished running terminal test @ 2019-11-28 17:15:06"
## [1] "Finished running simultaneous test @ 2019-11-28 17:15:06"
## [1] "Finished running subsequent test @ 2019-11-28 17:15:06"
## [1] "ID of significant loci completed @ 2019-11-28 17:15:06"
#pas has insufficient phenotypic data
#pas <- treeWAS(snps = mtb_snps,
#phen = mtb_phen_pas,
#tree = mtb_tree,
#seed = 1)
#pza has insufficient phenotypic data
#pza <- treeWAS(snps = mtb_snps,
#phen = mtb_phen_pza,
#tree = mtb_tree,
#seed = 1)
#amk has insufficient phenotypic data
#amk <- treeWAS(snps = mtb_snps,
#phen = mtb_phen_amk,
#tree = mtb_tree,
#seed = 1)
#moxi has insufficient phenotypic data
#moxi <- treeWAS(snps = mtb_snps,
#phen = mtb_phen_moxi,
#tree = mtb_tree,
#seed = 1)
#pro has insufficient phenotypic data
#pro <- treeWAS(snps = mtb_snps,
#phen = mtb_phen_pro,
#tree = mtb_tree,
#seed = 1)
#clo has insufficient phenotypic data
#clo <- treeWAS(snps = mtb_snps,
#phen = mtb_phen_clo,
#tree = mtb_tree,
#seed = 1)
#rbu has insufficient phenotypic data
#rbu <- treeWAS(snps = mtb_snps,
#phen = mtb_phen_rbu,
#tree = mtb_tree,
#seed = 1)
#levo has insufficient phenotypic data
#levo <- treeWAS(snps = mtb_snps,
#phen = mtb_phen_levo,
#tree = mtb_tree,
#seed = 1)
Of capreomycin, ethambutol, ethionamide, isoniazid, kanamycin, ofloxacin, rifampin, and streptomycin, the results can be summarized in the following table:
loci <- c(3696, 3833, 7296, 8943, 17249)
rifR <- c(T, T, F, T, F)
oflxR <- c(T, F, F, F, F)
kanR <- c(T, F, F, F, F)
inhR <- c(T, F, F, T, F)
etaR <- c(T, T, T, F, F)
embR <- c(T, T, F, F, T)
strR <- c(F, T, F, F, F)
pca_lmm_GWAS <- c(T, F, T, F, T)
gene <- c("rpoB", "rpsL", "fabG1", "lldD2", "embA")
annotation <- c("RNA polB","small ribosomal protein","3-oxoacyl-ACP reductase","L-lactate dehydrogenase","arabinosyltransferase")
mtb_results <- tibble(loci, rifR, oflxR, kanR, inhR, etaR, embR, strR, pca_lmm_GWAS, gene, annotation)
mtb_results
## # A tibble: 5 x 11
## loci rifR oflxR kanR inhR etaR embR strR pca_lmm_GWAS gene annotation
## <dbl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <chr> <chr>
## 1 3696 TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE rpoB RNA polB
## 2 3833 TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE rpsL small ribo…
## 3 7296 FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE fabG1 3-oxoacyl-…
## 4 8943 TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE lldD2 L-lactate …
## 5 17249 FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE embA arabinosyl…
Despite using a dataset an order of magnitude smaller than the initial study, treeWAS was still able to identify five loci that had statistically significant association with various antimicrobial resistance phenotypic traits. Of those five loci, two were ones that were not identified through a PCA-LMM GWAS approach. This presents, at the very least, a potential complimentary aspect this tool may have when conducting other forms of GWAS, if not also treeWAS’s ability to perform well conducting GWAS with a smaller sample size, a known strength of homoplasy approaches.
The three loci that were identified by both treeWAS and PCA-LMM GWAS are well-known and characterized resistance markers for Mycobacterium tuberculosis, which further validates the results of this exploratory project.
The issues conducting GWAS on the Helicobacter pylori genomic dataset were disappointing. As the bacterium is the subject of my own Master’s research project, it is upsetting that I was not able to find a way to make the data analysis work. Further, Berthenet et al.’s 2018 paper is the only instance of GWAS being conducting on Helicobacter pylori. The exciting aspect of this, however, is that my familiarness with treeWAS might mean that I may be able to leverage this tool in my own graduate research project.
treeWAS is an extremely easy to use R package. Once a user is familiar with the format that the input files need to be in, it doesn’t take very long or much coding and data manipulation to run the tool, which runs extremely quickly, in just a matter of minutes. It is also very easy to use specific arguments to change parameters of the tool, and the output is easy to follow and understand. One improvement I would suggest for the tool is reporting p-values with precision, rather than just indicating whether they are smaller than a Bonferroni corrected threshold of 10^-5.
Ultimately, I believe that treeWAS is a tool and package of high quality, something that I expect from the Didelot lab, and I am extremely happy that I was able to become familiar and explore the tool with this project.
Berthenet, E., Yahara, K., Thorell, K., Pascoe, B., Meric, G., Mikhail, J. M., … Sheppard, S. K. (2018). A GWAS on Helicobacter pylori strains points to genetic variants associated with gastric cancer risk. BMC biology, 16(1), 84. doi:10.1186/s12915-018-0550-3
Chen, P. E., and Shapiro, B. J. (2015) The advent of genome-wide association studies for bacteria. Curr Opin Microbiol. 25:17-24. doi: 10.1016/j.mib.2015.03.002.
Collins, C., and Didelot, X. (2018) A phylogenetic method to perform genome-wide association studies in microbes that accounts for population structure and recombination. PLOS Computational Biology 14(2): e1005958. https://doi.org/10.1371/journal.pcbi.1005958
Earle, S., Wu, C., Charlesworth, J. et al. Identifying lineage effects when controlling for population structure improves power in bacterial association studies. Nat Microbiol 1, 16041 (2016) doi:10.1038/nmicrobiol.2016.41
Falush, D. Bacterial genomics: Microbial GWAS coming of age. Nat Microbiol 1, 16059 (2016) doi:10.1038/nmicrobiol.2016.59
Farhat, M.R., Freschi, L., Calderon, R. et al. GWAS for quantitative resistance phenotypes in Mycobacterium tuberculosis reveals resistance genes and regulatory regions. Nat Commun 10, 2128 (2019) doi:10.1038/s41467-019-10110-6
Lees, J., Bentley, S. Bacterial GWAS: not just gilding the lily. Nat Rev Microbiol 14, 406 (2016) doi:10.1038/nrmicro.2016.82
Petkau A, Mabon P, Sieffert C, Knox N, Cabral J, Iskander M, Iskander M, Weedmark K, Zaheer R, Katz L, Nadon C, Reimer A, Taboada E, Beiko R, Hsiao W, Brinkman F, Graham M, Van Domselaar G. (2017) SNVPhyl: a single nucleotide variant phylogenomics pipeline for microbial genomic epidemiology. M Gen 3(6): doi:10.1099/mgen.0.000116.
Price, A., Patterson, N., Plenge, R. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904–909 (2006) doi:10.1038/ng1847
Saber, M. M. and Shapiro, J. (2019) Benchmarking bacterial genome-wide association study (GWAS) methods using simulated genomes and phenotypes. bioRxiv 795492; doi: https://doi.org/10.1101/795492 (pre-print)